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Abstract 

In  the  problem  of  multivariate  regression,  a  A-dimensional  response  vector  is  re¬ 
gressed  upon  a  common  set  of  p  covariates,  with  a  matrix  B*  G  of  regression 

coefficients.  We  study  the  behavior  of  the  group  Lasso  using  £1/^2  regularization  for 
the  union  support  problem,  meaning  that  the  set  of  s  rows  for  which  B*  is  non-zero 
is  recovered  exactly.  Studying  this  problem  under  high-dimensional  scaling,  we  show 
that  group  Lasso  recovers  the  exact  row  pattern  with  high  probability  over  the  random 
design  and  noise  for  scalings  of  {n,p,s)  such  that  the  sample  complexity  parameter 
given  by  0{n,p,  s)  :  =  n/[2'tj;{B*)  log(p  —  s)]  exceeds  a  critical  threshold.  Here  n  is  the 
sample  size,  p  is  the  ambient  dimension  of  the  regression  model,  s  is  the  number  of 
non-zero  rows,  and  'ipiB*)  is  a  sparsity-overlap  function  that  measures  a  combination 
of  the  sparsities  and  overlaps  of  the  A-regression  coefficient  vectors  that  constitute  the 
model.  This  sparsity-overlap  function  reveals  that,  if  the  design  is  uncorrelated  on  the 
active  rows,  block  Ixfii  regularization  for  multivariate  regression  never  harms  perfor¬ 
mance  relative  to  an  ordinary  Lasso  approach,  and  can  yield  substantial  improvements 
in  sample  complexity  (up  to  a  factor  of  A)  when  the  regression  vectors  are  suitably  or¬ 
thogonal.  For  more  general  designs,  it  is  possible  for  the  ordinary  Lasso  to  outperform 
the  group  Lasso.  We  complement  our  analysis  with  simulations  that  demonstrate  the 
sharpness  of  our  theoretical  results,  even  for  relatively  small  problems. 


1  Introduction 

The  development  of  efficient  algorithms  for  large-scale  model  selection  has  been  a  major 
goal  of  statistical  learning  research  in  the  last  decade.  There  is  now  a  substantial  body 
of  work  based  on  t'l-regularization,  dating  back  to  the  seminal  work  of  Tibshirani  (1996) 
and  Donoho  and  collaborators  (Chen  et  ah,  1998;  Donoho  and  Huo,  2001).  The  bulk  of 
this  work  has  focused  on  the  standard  problem  of  linear  regression,  in  which  one  makes 
observations  of  the  form 


y  =  XfT^w,  (1) 

where  y  G  M”  is  a  real- valued  vector  of  observations,  w  G  is  an  additive  zero-mean  noise 
vector,  and  X  G  is  the  design  matrix.  A  subset  of  the  components  of  the  unknown 

parameter  vector  /S*  G  are  assumed  non-zero;  the  model  selection  goal  is  to  identify 
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these  coefficients  and  (possibly)  estimate  their  values.  This  goal  can  be  formulated  in  terms 
of  the  solution  of  a  penalized  optimization  problem: 

argmm  {i||y-X/3|||  +  A„||/3||o}  ,  (2) 

where  ||/3||o  counts  the  number  of  non-zero  components  in  (3  and  where  >  0  is  a  regular¬ 
ization  parameter.  Unfortunately,  this  optimization  problem  is  computationally  intractable, 
a  fact  which  has  led  various  authors  to  consider  the  convex  relaxation  (Tibshirani,  1996; 
Chen  et  ah,  1998) 

min  {^||y  -  X/3||| -h  A„||/3||i}  ,  (3) 

in  which  ||/3||o  is  replaced  with  the  norm  ||/3||i.  This  relaxation,  often  referred  to  as  the 
Lasso  (Tibshirani,  1996),  is  a  quadratic  program,  and  can  be  solved  efficiently  by  various 
methods  (e.g.,  Boyd  and  Vandenberghe,  2004;  Osborne  et  ah,  2000;  Efron  et  ah,  2004)). 

A  variety  of  theoretical  results  are  now  in  place  for  the  Lasso,  both  in  the  traditional 
setting  where  the  sample  size  n  tends  to  infinity  with  the  problem  size  p  fixed  (Knight  and 
Fu,  2000),  as  well  as  under  high-dimensional  scaling,  in  which  p  and  n  tend  to  infinity  simul¬ 
taneously,  thereby  allowing  p  to  be  comparable  to  or  even  larger  than  n  (e.g.,  Meinshausen 
and  Biihlmann,  2006;  Wainwright,  2006;  Zhao  and  Yu,  2006).  In  many  applications,  it  is 
natural  to  impose  sparsity  constraints  on  the  regression  vector  f3*,  and  a  variety  of  such 
constraints  have  been  considered.  For  example,  one  can  consider  a  “hard  sparsity”  model 
in  which  (3*  is  assumed  to  contain  at  most  s  non-zero  entries  or  a  “soft  sparsity”  model  in 
which  (3*  is  assumed  to  belong  to  an  (.q  ball  with  g  <  1.  Analyses  also  differ  in  terms  of  the 
loss  functions  that  are  considered.  For  the  model  or  variable  selection  problem,  it  is  natural 
to  consider  the  {0  — l}-loss  associated  with  the  problem  of  recovering  the  unknown  support 
set  of  (3* .  Alternatively,  one  can  view  the  Lasso  as  a  shrinkage  estimator  to  be  compared 
to  traditional  least  squares  or  ridge  regression;  in  this  case,  it  is  natural  to  study  the  ^2-loss 
11/3  —  f3*\\2  between  the  estimate  (3  and  the  ground  truth.  In  other  settings,  the  prediction 
error  E[(Y  —  Y"^/?)^]  may  be  of  primary  interest,  and  one  tries  to  show  risk  consistency 
(namely,  that  the  estimated  model  predicts  as  well  as  the  best  sparse  model,  whether  or 
not  the  true  model  is  sparse). 

1.1  Block-structured  regularization 

While  the  assumption  of  sparsity  at  the  level  of  individual  coefficients  is  one  way  to  give 
meaning  to  high-dimensional  {p  /§>  n)  regression,  there  are  other  structural  assumptions 
that  are  natural  in  regression,  and  which  may  provide  additional  leverage.  For  instance, 
in  a  hierarchical  regression  model,  groups  of  regression  coefficients  may  be  required  to 
be  zero  or  non-zero  in  a  blockwise  manner;  for  example,  one  might  wish  to  include  a 
particular  covariate  and  all  powers  of  that  covariate  as  a  group  (Yuan  and  Lin,  2006;  Zhao 
et  ah,  2007).  Another  example  arises  when  we  consider  variable  selection  in  the  setting  of 
multivariate  regression:  multiple  regressions  can  be  related  by  a  (partially)  shared  sparsity 
pattern,  such  as  when  there  are  an  underlying  set  of  covariates  that  are  “relevant”  across 
regressions  (Obozinski  et  ah,  2007;  Argyriou  et  ah,  2006;  Turlach  et  ah,  2005;  Zhang  et  ah, 
2008).  Based  on  such  motivations,  a  recent  line  of  research  (Bach  et  ah,  2004;  Tropp,  2006; 
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Yuan  and  Lin,  2006;  Zhao  et  ah,  2007;  Obozinski  et  ah,  2007;  Ravikumar  et  ah,  2008)  has 
studied  the  use  of  block-regularization  schemes^  in  which  the  norm  is  composed  with  some 
other  ^q  norm  {q  >  1),  thereby  obtaining  the  li/iq  norm  defined  as  a  sum  of  (.q  norms  over 
groups  of  regression  coefficients.  The  best  known  examples  of  such  block  norms  are  the 
hi/£oo  norm  (Turlach  et  ah,  2005;  Zhang  et  ah,  2008),  and  the  hi/h2  norm  (Obozinski  et  ah, 
2007). 

In  this  paper,  we  investigate  the  use  of  ^2  block-regularization  in  the  context  of  high¬ 
dimensional  multivariate  linear  regression,  in  which  a  collection  of  K  scalar  outputs  are 
regressed  on  the  same  design  matrix  X  G  Representing  the  regression  coefficients  as 

an  p  X  K  matrix  B*,  the  multivariate  regression  model  takes  the  form 

Y  =  XB*  +  W,  (4) 

where  Y  G  and  W  G  are  matrices  of  observations  and  zero-mean  noise  respec¬ 

tively.  In  addition,  we  assume  a  hard-sparsity  model  for  the  regression  coefficients  in  which 
column  j  of  the  coefficient  matrix  B*  has  non-zero  entries  on  a  subset 

:=  {i  G  {!,..., p}  I  (5) 

of  size  Sk  ■=  lYfcl.  We  focus  on  the  problem  of  recovering  the  union  of  the  supports, 
namely  the  set  S'  :  =  u|^^S'fc,  corresponding  to  the  subset  of  indices  i  G  {!,..., p}  that 
are  involved  in  at  least  one  regression.  This  union  support  problem  can  be  understood  as 
the  generalization  of  variable  selection  to  the  group  setting.  Rather  than  selecting  specific 
components  of  a  coefficient  vector,  we  aim  to  select  specific  rows  of  a  coefficient  matrix.  We 
thus  also  refer  to  the  union  support  problem  as  the  row  selection  problem.  Note  finally  that 
recovering  S  is  not  equivalent  to  recovering  each  of  the  individual  supports  S'^. 

If  computational  complexity  were  not  a  concern,  the  natural  way  to  perform  row  selection 
for  B*  would  be  by  solving  the  optimization  problem 

where  B  =  {Pik)i<i<p  i<k<K  is  a  p  x  RT  matrix,  the  quantity  \l-\lp  denotes  the  Frobenius 
norm^,  and  the  “norm”  counts  the  number  of  rows  in  B  that  have  non-zero  iq  norm. 

As  before,  the  io  component  of  this  regularizer  yields  a  non-convex  and  computationally 
intractable  problem,  so  that  it  is  natural  to  consider  the  relaxation 

arg  min  |||y  -  AR|||^ -h  A„  ||R||^  ,,  I  ,  (7) 

B^RpxK  [  2n  ‘‘  J 

where  is  the  block  £i/£q  norm: 

E>  i;/3?,  =  Ellftll,'  (8) 

i=l  ^  j=l  i=l 

^The  Frobenius  norm  of  a  matrix  A  is  given  by  |||T|||^  :  =  jYii  j 
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The  relaxation  (7)  is  a  natural  generalization  of  the  Lasso;  indeed,  it  specializes  to  the 
Lasso  in  the  case  K  =  1.  For  later  reference,  we  also  note  that  setting  ^  =  1  leads  to 
the  use  of  the  ii/ii  block-norm  in  the  relaxation  (7).  Since  this  norm  decouples  across 
both  the  rows  and  columns,  this  particular  choice  is  equivalent  to  solving  K  separate  Lasso 
problems,  one  for  each  column  of  the  px  K  regression  matrix  B* .  A  more  interesting  choice 
is  g  =  2,  which  yields  a  block  ii/i2  norm  that  couples  together  the  columns  of  B.  This 
regularization  is  commonly  referred  to  as  the  group  Lasso.  As  we  discuss  in  Appendix  2, 
the  group  Lasso  with  q  =  2  can  be  cast  as  a  second-order  cone  program  (SOCP),  a  family 
of  optimization  problems  that  can  be  solved  efficiently  with  interior  point  methods  (Boyd 
and  Vandenberghe,  2004),  and  includes  quadratic  programs  as  a  particular  case. 

Some  recent  work  has  addressed  certain  statistical  aspects  of  block-regularization  schemes. 
Meier  et  al.  (2008)  have  performed  an  analysis  of  risk  consistency  with  block-norm  regular¬ 
ization.  Bach  (2008)  provides  an  analysis  of  block-wise  support  recovery  for  the  kernelized 
group-Lasso  in  the  classical,  fixed  p  setting.  In  the  high- dimensional  setting,  Ravikumar 
et  al.  (2008)  have  studied  the  consistency  of  block- wise  support  recovery  for  the  group-Lasso 
(1^1 /f’2)  for  fixed  design  matrices,  and  their  result  is  generalized  by  Liu  and  Zhang  (2008) 
to  block- wise  support  recovery  in  the  setting  of  general  regularization,  again  for  fixed 

design  matrices.  However,  these  analyses  do  not  discriminate  between  various  values  of  q, 
yielding  the  same  qualitative  results  and  the  same  convergence  rates  for  g  =  1  as  for  q>  1. 
Our  focus,  which  is  motivated  by  the  empirical  observation  that  the  group  Lasso  can  out¬ 
perform  the  ordinary  Lasso  (Bach,  2008;  Yuan  and  Lin,  2006;  Zhao  et  ah,  2007;  Obozinski 
et  ah,  2007),  is  precisely  the  distinction  between  q  =  l  and  q>  1  (specifically  q  =  2). 

The  distinction  between  q  =  1  and  g  =  2  is  also  significant  from  an  optimization- 
theoretic  point  of  view.  In  particular,  the  SOCP  relaxations  underlying  the  group  Lasso  {q  = 
2)  are  generally  tighter  than  the  quadratic  programming  relaxation  underlying  the  Lasso 
(g  =  1);  however,  the  improved  accuracy  is  generally  obtained  at  a  higher  computational 
cost  (Boyd  and  Vandenberghe,  2004).  Thus  we  can  view  our  problem  as  an  instance  of 
the  general  question  of  the  relationship  of  statistical  efficiency  to  computational  efficiency: 
does  the  qualitatively  greater  amount  of  computational  effort  involved  in  solving  the  group 
Lasso  always  yield  greater  statistical  efficiency?  More  specifically,  can  we  give  theoretical 
conditions  under  which  solving  the  generalized  Lasso  problem  (7)  has  greater  statistical 
efficiency  than  naive  strategies  based  on  the  ordinary  Lasso?  Conversely,  can  the  group 
Lasso  ever  be  worse  than  the  ordinary  Lasso? 

With  this  motivation,  this  paper  provides  a  detailed  analysis  of  model  selection  consis¬ 
tency  of  the  group  Lasso  (7)  with  £1 /1'2-regularization.  Statistical  efficiency  is  defined  in 
terms  of  the  scaling  of  the  sample  size  n,  as  a  function  of  the  problem  size  p  and  sparsity 
structure  of  the  regression  matrix  B*,  required  for  consistent  row  selection.  Our  analysis 
is  high-dimensional  in  nature,  allowing  both  n  and  p  to  diverge,  and  yielding  explicit  error 
bounds  as  a  function  of  p.  As  detailed  below,  our  analysis  provides  affirmative  answers  to 
both  of  the  questions  above.  First,  we  demonstrate  that  under  certain  structural  assump¬ 
tions  on  the  design  and  regression  matrix  B* ,  the  group  £1 /1'2-Lasso  is  always  guaranteed 
to  out-perform  the  ordinary  Lasso,  in  that  it  correctly  performs  row  selection  for  sample 
sizes  for  which  the  Lasso  fails  with  high  probability.  Second,  we  also  exhibit  some  problems 
(though  arguably  not  generic)  for  which  the  group  Lasso  will  be  outperformed  by  the  naive 
strategy  of  applying  the  Lasso  separately  to  each  of  the  K  columns,  and  taking  the  union 
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of  supports. 


1.2  Our  results 


The  main  contribution  of  this  paper  is  to  show  that  under  certain  technical  conditions  on 
the  design  and  noise  matrices,  the  model  selection  performance  of  block-regularized  ^1/^2 
regression  (7)  is  governed  by  the  sample  complexity  function 


Gh/i2in,p;B*) 


n 

2  log{p  —  s)  ’ 


(9) 


where  n  is  the  sample  size,  p  is  the  ambient  dimension,  s  =  jiSI  is  the  number  of  rows  that  are 
non-zero,  and  ^p{■)  is  a  sparsity-overlap  function.  Our  use  of  the  term  “sample  complexity” 
for  reflects  the  role  it  plays  in  our  analysis  as  the  rate  at  which  the  sample  size  must 

grow  in  order  to  obtain  consistent  row  selection  as  a  function  of  the  problem  parameters. 
More  precisely,  for  scalings  {n,p,  s,  B*)  such  that  9i.^ii^{n,p;B*)  exceeds  a  fixed  critical 
threshold  t*  G  (0,-|-oo),  we  show  the  probability  of  correct  row  selection  by  £1/^2  group 
Lasso  converges  to  one. 

Whereas  the  ratio  is  standard  for  high-dimensional  theory  on  £i-regularization,  the 
function  f^{B*)  is  a  novel  and  interesting  quantity,  which  measures  both  the  sparsity  of  the 
matrix  B* ,  as  well  as  the  overlap  between  the  different  regression  tasks,  represented  by  the 
columns  of  B*.  (See  equation  (15)  for  the  precise  definition  of  'ip{B*).)  As  a  particular 
illustration,  consider  the  special  case  of  a  single-task  or  univariate  regression  with  iL  =  1, 
in  which  the  convex  program  (7)  reduces  to  the  ordinary  Lasso  (3).  In  this  case,  if  the 
design  matrix  is  drawn  from  the  Standard  Gaussian  ensemble  (i.e.,  Xij  iV(0,l),  i.i.d), 
we  show  that  the  sparsity-overlap  function  reduces  to  f^{B*)  =  s,  corresponding  to  the 
support  size  of  the  single  coefficient  vector.  We  thus  recover  as  a  corollary  a  previously 
known  result  (Wainwright,  2006):  namely,  the  Lasso  succeeds  in  performing  exact  support 
recovery  once  the  ratio  n/[s\og{p  —  s)]  exceeds  a  certain  critical  threshold.  At  the  other 
extreme,  for  a  genuinely  multivariate  problem  with  A  >  1  and  s  non-zero  rows,  again  for 
a  Standard  Gaussian  design,  when  the  regression  matrix  is  “suitably  orthonormal”  relative 
to  the  design  (see  Section  2  for  a  precise  definition),  the  sparsity-overlap  function  is  given 
by  ip^B*)  =  s/K.  In  this  case,  ii/i2  block-regularization  has  sample  complexity  lower  by 
a  factor  of  K  relative  to  the  naive  approach  of  solving  K  separate  Lasso  problems.  Of 
course,  there  is  also  a  range  of  behavior  between  these  two  extremes,  in  which  the  gain 
in  sample  complexity  varies  smoothly  as  a  function  of  the  sparsity-overlap  'ip{B*)  in  the 
interval  [;^,s].  On  the  other  hand,  we  also  show  that  for  suitably  correlated  designs,  it 
is  possible  that  the  sample  complexity  'ip{B*)  associated  with  I’i/^2  row  selection  is  larger 
than  that  of  the  ordinary  Lasso  (^1/^1)  approach. 

The  remainder  of  the  paper  is  organized  as  follows.  In  Section  2,  we  provide  a  precise 
statement  of  our  main  result,  discuss  some  of  its  consequences,  and  illustrate  the  close 
agreement  between  our  theory  and  simulations.  Section  3  is  devoted  to  the  proof  of  this 
main  result,  with  the  argument  broken  down  into  a  series  of  steps.  Technical  results  are 
deferred  to  the  appendix.  We  conclude  with  a  brief  discussion  in  Section  4. 
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1.3  Notation 

We  collect  here  some  notation  used  throughout  the  paper.  For  a  (possibly  random)  matrix 
M  G  ,  we  define  the  Frobenius  norm  |||M|||p,  :=  (Yli  j  ,  and  for  parameters 

1  <  a  <  6  <  oo,  the  ia/^b  block  norm 

ii"ik/4  |e(EI"4'=i'')T' 

f  i=l  k  /  J 

These  vector  norms  on  matrices  should  be  distinguished  from  the  (a,  6)-operator  norms 

|||Af|||a,fe  :=  sup  ||Mx||a,  (11) 

Ikll6  =  l 

although  some  norms  belong  to  both  families;  see  Lemma  5  in  Appendix  B.  Important 
special  cases  of  the  latter  include  the  spectral  norm  |||M|||2,2  (also  denoted  |||M|||2),  and  the 
£oo-operator  norm  |||M|||oo,oo  =  denoted  |||M|||^  for  short. 

2  Main  result  and  some  consequences 

The  analysis  of  this  paper  applies  to  random  ensembles  of  multivariate  linear  regression 
problems,  each  of  the  form  (4),  where  the  noise  matrix  W  G  is  assumed  to  consist  of 

i.i.d.  elements  Wij  ~  A^(0,  <t^).  We  consider  random  design  matrices  X  with  each  row  drawn 
in  an  i.i.d.  manner  from  a  zero-mean  Gaussian  N(0,  S),  where  T,  y  0  is  a  p  x  p  covariance 
matrix.  We  note  in  passing  that  analogs  of  our  results  with  different  constants  apply  to  any 
design  with  sub-Gaussian  rows.^  Although  the  block-regularized  problem  (7)  need  not  have 
a  unique  solution  in  general,  a  consequence  of  our  analysis  is  that  in  the  regime  of  interest, 
the  solution  is  unique,  so  that  we  may  talk  unambiguously  about  the  estimated  support  S. 
The  main  object  of  study  in  this  paper  is  the  probability  P[S'  =  S],  where  the  probability 
is  taken  both  over  the  random  choice  of  noise  matrix  W  and  random  design  matrix  X.  We 
study  the  behavior  of  this  probability  as  elements  of  the  triplet  {n,p,  s)  tend  to  infinity. 

2.1  Notation  and  assnmptions 

More  precisely,  our  main  result  applies  to  sequences  of  models  indexed  by  {n,p{n),  s(n)),  an 
associated  sequence  of  pxp  covariance  matrices,  and  a  sequence  {B*}  of  coefficient  matrices 
with  row  support 


5  :=  {i  I  AVO} 


(12) 


of  size  |5|  =  s  =  s(n).  We  use  to  denote  its  complement  (i.e.,  : 

let 


&min  :  =  min||A*ll2, 

iGb 


{!,..., p}\5).  We 
(13) 


correspond  to  the  minimal  £2  row- norm  of  the  coefficient  matrix  B*  over  its  non-zero  rows. 
We  impose  the  following  conditions  on  the  covariance  S  of  the  design  matrix: 

^See  Buldygin  and  Kozachenko  (2000)  for  more  details  on  sub-Gaussian  random  vectors. 
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(Al)  Bounded  eigenspectrum:  There  exists  fixed  constants  Cmin  >  0  and  Cmax  <  +00 
such  that  all  eigenvalues  of  the  s  x  s  matrix  T,ss  are  contained  in  the  interval 


[C'miii)  C*max]' 


{A2)  Mutual  incoherence:  There  exists  a  fixed  incoherence  parameter  7  G  (0, 1]  such 
that 


|||Ss'c5(i:5s')  ^11^  <  1-7. 

(A3)  Self-incoherence:  There  exist  Dm  ax  <  +00  such  that  |||(Ss'5)“^|||^  <  Dmax- 

Assumption  A1  prevents  excess  dependence  among  elements  of  the  design  matrix  associ¬ 
ated  with  the  support  S]  conditions  of  this  form  are  required  for  model  selection  consistency 
or  £2  consistency  of  the  Lasso.  The  mutual  incoherence  assumption  and  self- incoherence 
assumptions  also  well  known  from  previous  work  on  variable  selection  consistency  of  the 
Lasso  (Meinshausen  and  Biihlmann,  2006;  Tropp,  2006;  Zhao  and  Yu,  2006).  Although  such 
incoherence  assumptions  are  not  needed  in  analyzing  £2  or  risk  consistency,  they  are  known 
to  be  necessary  for  model  selection  consistency  of  the  Lasso.  Indeed,  in  the  absence  of  such 
conditions,  it  is  always  possible  to  make  the  Lasso  fail,  even  with  an  arbitrarily  large  sample 
size.  (However,  see  Meinshausen  and  Yu  (2008)  for  methods  that  weaken  the  incoherence 
condition.)  Note  that  these  assumptions  are  trivially  satisfied  by  the  standard  Gaussian 
ensemble  S  =  Ipxp,  with  Cmm  =  Gmax  =  I,  Dm  ax  =  1,  and  7  =  1.  More  generally,  it  can  be 
shown  that  various  matrix  classes  (e.g.,  Toeplitz  matrices,  tree-structured  covariance  matri¬ 
ces,  bounded  off-diagonal  matrices)  satisfy  these  conditions  (Meinshausen  and  Biihlmann, 
2006;  Zhao  and  Yu,  2006;  Wainwright,  2006). 

2.2  Statement  of  main  result 

We  require  a  few  pieces  of  notation  before  stating  the  main  result.  For  an  arbitrary  matrix 
Bs  G  with  row  /?*  G  we  define  the  matrix  Ci^s)  G  with  row 

cm  jAt.  (14) 

With  this  notation,  the  sparsity- overlap  function  is  given  by 

Vi(D)  :=  |||C(D5)^(S55)-'C(i?5)|||2,  (15) 

where  |||•|||2  denotes  the  spectral  norm.  Finally,  the  sample  complexity  function  is  given  by 

Tl 

:=  2iP{B*)log{p-s)' 

With  this  setup,  we  have  the  following  result: 

Theorem  1.  Consider  a  random  design  matrix  X  drawn  with  i.i.d.  N(0,T,)  row  vectors, 
where  S  satisfies  assumptions  Al  through  A3,  and  an  observation  matrix  Y  specified  by 
model  (4).  Suppose  that  the  squared  minimum  value  (&min)^  decays  no  more  slowly  than 
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+00.  Then  for  all 


f{p)  min{i,  iog(p_^)}  for  some  funetion  f{p)/s  0  and  f{p) 
sequenees  {n,p,B*)  sueh  that 


Oh/i2in,p;B*) 


n 

2  il>{B*)  log{p  —  s) 


>  t*(S)  :  = 


a 


T 


we  have  with  probability  greater  than  1  —  ci  exp(c2  log  s)  : 

(a)  the  SOCP  (7)  with  Xn  =  has  a  unique  solution  B,  and 

(b)  the  row  support  set 


S  =  S{B)  :=  |i  I 


(17) 


(18) 


speeified  by  this  unique  solution  is  equal  to  the  row  support  set  S{B*)  of  the  true  model. 


2.3  Some  consequences  of  Theorem  1 

We  begin  by  making  some  simple  observations  about  the  sparsity  overlap  function. 


Lemma  1.  (a)  For  any  design  satisfying  assumption  Al,  the  sparsity-overlap 'if  {B*)  obeys 

the  bounds 

7^  <  mi  <  ^  (19) 

^max-^  ^min 


(b)  If  T,ss  =  Isxs,  and  if  the  eolumns  of  the  matrix  Z*  =  C{B*)  are  orthogonal, 

then  the  sparsity  overlap  funetion  is  'if{B*)  =  max^ 

Proof,  (a)  To  verify  this  claim,  we  first  set  Zg  =  f{Bg),  and  use  Z^g^*  to  denote  the 
column  of  Zg.  Since  the  spectral  norm  is  upper  bounded  by  the  sum  of  eigenvalues,  and 
lower  bounded  by  the  average  eigenvalue,  we  have 

^tTiZfEg^gZ*g)  <  fj{B*)  <  tT{Z*gmglZ*g). 

Given  our  assumption  {Al)  on  Yiss-,  we  have 


K 


iT{Z*g 


T^- 


=  E 

k=l 


S 


^  — 1  y{k)* 
^SS^S 


> 


Cr, 


k=l 


S 


C*max 


using  the  fact  that  11-^5^^* IP  =  =  '®-  Similarly,  in  the  other  direction,  we 

have 


K 


tT{ZfYg^gZ*g)=Y,Z, 


(fc)=t 

s 


k=l 


^  —  1  y{k)*  ^ 
^SS^S  -  (J 


1 


K 

E 

k=l 


vik)*  ||2 


S 


F  ■  ’ 


which  completes  the  proof. 

(b)  Under  the  assumed  orthogonality,  the  matrix  Z*'^ Z*  is  diagonal  with  ^  as 

the  diagonal  elements,  so  that  the  largest  is  then  the  largest  eigenvalue  of  the 

matrix.  □ 


Based  on  this  lemma,  we  now  study  some  special  cases  of  Theorem  1.  The  simplest 
special  case  is  the  univariate  regression  problem  (K  =  l),  in  which  case  the  quantity  CiP*) 
(as  defined  in  equation  (14))  simply  outputs  an  s-dimensional  sign  vector  with  elements 
z*  =  sign(/3*).  (Recall  that  the  sign  function  is  defined  as  sign(O)  =  0,  sign(x)  =  1  if 
X  >  0  and  sign(x)  =  — 1  if  x  <  0.)  In  this  case,  the  sparsity  overlap  function  is  given  by 
■ipiP*)  =  z*'^ {T,ss)~^z* ,  and  as  a  consequence  of  Lemma  1(a),  we  have  PiP*)  =  0(s). 
Consequently,  a  simple  corollary  of  Theorem  1  is  that  the  Lasso  succeeds  once  the  ratio 
n/(2slog(p  —  s))  exceeds  a  certain  critical  threshold,  determined  by  the  eigenspectrum  and 
incoherence  properties  of  S.  This  result  matches  the  necessary  and  sufficient  conditions 
established  in  previous  work  on  the  Lasso  (Wainwright,  2006). 

We  can  also  use  Lemma  1  and  Theorem  1  to  compare  the  performance  of  the  group 
Lasso  to  the  following  (arguably  naive)  strategy  for  row  selection  using  the  ordinary  Lasso: 

Row  selection  using  ordinary  Lasso: 

1.  Apply  the  ordinary  Lasso  separately  to  each  of  the  K  univariate  regression  problems 
specified  by  the  columns  of  B* ,  thereby  obtaining  estimates  P^^'^  for  k  =  1, . . . ,  K . 

2.  For  k  =  1, . . . ,  K,  estimate  the  column  support  via  Sk  '■=  {i  \  Pi^'^  /  0}. 

3.  Estimate  the  row  support  by  taking  the  union:  S  = 

To  understand  the  conditions  governing  the  success/failure  of  this  procedure,  note  that  it 
succeeds  if  and  only  if  for  each  non-zero  row  i  G  S'  =  u(L]^Sfc,  the  variable  P^^'^  is  non-zero  for 
at  least  one  k,  and  for  all  j  G  S'^  =  {1, . . .  ,p}\S,  the  variable  =  0  for  all  A:  =  1, . . . ,  iL. 
From  our  understanding  of  the  univariate  case,  we  know  that  for  C  =  2t*(S),  the  condition 

n>C  max  'il;{P*g^^)log{p  —  Sk)  >  C  max  'p{P*g^^)  \og{p  —  s)  (20) 

is  sufficient  to  ensure  that  the  ordinary  Lasso  succeeds  in  row  selection.  Conversely,  if 
n  <  maxfc=i ,,,  '^{P*^^'^)  log(p  —  s),  then  there  will  exist  some  j  G  such  for  at  least  one 

/cG  {!,... ,iL},  there  holds  Pj^'^  7^  0  with  high  probability,  implying  failure  of  the  ordinary 
Lasso. 

A  natural  question  is  whether  the  group  Lasso,  by  taking  into  account  the  couplings 
across  columns,  always  outperforms  (or  at  least  matches)  this  naive  strategy.  The  following 
result  shows  that  if  the  design  is  uncorrelated  on  its  support,  then  indeed  this  is  the  case. 

Corollary  1  (Group  Lasso  versus  ordinary  Lasso).  Assume  that  S55  =  Isxs-  Then  for  any 
multivariate  regression  problem,  row  seleetion  using  the  ordinary  Lasso  strategy  requires, 
with  high  probability,  at  least  as  many  samples  as  the  iij ^2  group  Lasso.  In  partieular,  the 
relative  effieieney  of  group  Lasso  versus  ordinary  Lasso  is  given  by  the  ratio 

max  V'(/3s  ^^)  log(p  -  Sk) 

P{B*s)log{p-s)  -  ■  ^  > 
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Proof.  From  our  discussion  preceding  the  statement  of  Corollary  1,  we  know  that  the  quan¬ 
tity 

max  Tp{P*J'^^)log{p  —  Sk)  =  max  Sklog{p  —  s^)  >  max  Sfclog(p  — s) 

k 1  5  .  .  .  }-/^  k 1  5  .  •  .  5  k 1  5  .  .  •  5 

governs  the  performance  of  the  ordinary  Lasso  procedure  for  row  selection.  It  remains  to 
show  then  that  <  max^  Sk- 

As  before,  we  use  the  notation  Zg  =  and  Z*  for  the  row  of  Zg.  Since 

^SS  =  Isxs,  we  have  'f{B*)  =  ||.^5||2-  Consequently,  by  the  variational  representation  of 
the  .^2-iiorm,  we  have 


P{B*) 


max 
:  |b||<l 


\Z%x 


5-^11  ^ 


<  max  y 

a:eIR^  :  ||a:||<l  ^ 
2=1 


zrx 


Let  \Z*\  =  {\Z*,\,... 
Schwartz  inequality. 


and  Vi  =  (xi  sign(Z*^),  ...,xk  sign(Z*j^))'^.  By  the  Cauchy- 


(zfx)  =(|Z-fy.)  <l||Z*l|ltly.ll"  =  l|Z* 


k 


xl 


sign(z; 


ik) 


so  that,  if  ||x||  <  1,  we  have 


i=l 


s  K  K  s  K 

<  sign(^rfc)^  =  ^x|^sign(Z;fc)2  =  ^xlsk 

i=l  k=l  k=l  i=l  k=l 


< 


max  Sh, 
l<k<K 


thereby  establishing  the  claim.  □ 

We  illustrate  Corollary  1  by  considering  some  special  cases: 

Example  1  (Identical  regressions).  Suppose  that  B*  :=  /3*1^ — that  is,  B*  consists 
of  K  copies  of  the  same  coefficient  vector  /S*  G  M^,  with  support  of  cardinality  151  =  s. 
We  then  have  [C{B*)]ij  =  sign(/3*) /a/A,  from  which  we  see  that  p^B*)  =  z*'^ {'Zss)~^ z* 
with  z*  being  an  s-dimensional  sign  vector  with  elements  z*  =  sign(/3*).  Consequently,  we 
have  the  equality  p^B*)  =  'il>{f3PP),  so  that  there  is  no  benefit  in  using  the  group  Lasso 
relative  to  the  strategy  of  solving  separate  Lasso  problems  and  constructing  the  union 
of  individually  estimated  supports.  This  fact  might  seem  rather  pessimistic,  since  under 
model  (4),  we  essentially  have  Kn  observations  of  the  coefficient  vector  /3*  with  the  same 
design  matrix  but  K  independent  noise  realizations.  However,  under  the  given  conditions, 
the  rates  of  convergence  for  model  selection  in  high-dimensional  results  such  as  Theorem  1 
are  determined  by  the  number  of  interfering  variables,  p—s,  as  opposed  to  the  noise  variance. 

In  contrast  to  this  pessimistic  example,  we  now  turn  to  the  most  optimistic  extreme: 

Example  2  (“Orthonormal”  regressions).  Suppose  that  (Ss'^)  =  Igxs  and  (for  s  >  K) 
suppose  that  B*  is  constructed  such  that  the  columns  of  the  s  x  K  matrix  C{B*)  are  all 
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orthonormal.  Under  these  conditions,  we  claim  that  the  sample  complexity  of  group  Lasso 
is  lower  than  that  of  the  ordinary  Lasso  by  a  factor  of  XjK.  Indeed,  we  observe  that 

K 

KifjiB*)  =  =  tr  =  tr  (^Z*Z*^^  =  s, 

k=l 

because  Z*Z*'^  G  is  the  Gram  matrix  of  s  unit  vectors  of  and  its  diagonal  elements 
are  therefore  all  equal  to  1.  Consequently,  the  group  Lasso  recovers  the  row  support  with 
high  probability  for  sequences  such  that 

n 

2  log{p  -s)  ^ 

which  allows  for  sample  sizes  1/K  smaller  than  the  ordinary  Lasso  approach. 

Corollary  1  and  the  subsequent  examples  address  the  case  of  uncorrelated  design  (Sss  = 
Isxs)  on  the  row  support  S,  for  which  the  group  Lasso  is  never  worse  than  the  ordinary  Lasso 
in  performing  row  selection.  The  following  example  shows  that  if  the  supports  are  disjoint, 
the  ordinary  Lasso  has  the  same  sample  complexity  as  the  group  Lasso  for  uncorrelated 
design  T,ss  =  Isxs,  but  can  be  better  than  the  group  Lasso  for  designs  T,ss  with  suitable 
correlations: 


Corollary  2  (Disjoint  supports).  Suppose  that  the  support  sets  Sk  of  individual  regression 
problems  are  all  disjoint.  Then  for  any  design  eovarianee  T,ss,  we  have 


max 


(h)  ^ 

fc=i 


(22) 


Proof.  First  note  that,  since  all  supports  are  disjoint,  =  sign(/3*^),  so  that  = 

Inequality  (b)  is  then  immediate,  since  <  ti^Zg'^TiggZg).  To 

establish  inequality  (a),  we  note  that 


=  max  Z^'^T^^lZlx  >  max  el zf'z.-j.lzlek  =  max  . 

xeM-ff :  ||x||<l  bb  b  i<k<K  *  bb  b  i<k<K  ^  bb  b 


□ 


We  illustrate  Corollary  2  with  an  example. 

Example  3.  Disjoint  support  with  uncorrelated  design  Suppose  that  S55  =  Isxs, 
and  the  supports  are  disjoint.  In  this  case,  we  claim  that  the  sample  complexity  of  the  ^1/^2 
group  Lasso  is  the  same  as  the  ordinary  Lasso.  If  the  individual  regressions  have  disjoint 
support,  then  Zg  =  CiBg)  has  only  a  single  non-zero  entry  per  row  and  therefore  the 
columns  of  Z*  are  orthogonal.  Moreover,  Z*j^  =  sign(/3|^^*).  By  Lemma  1(b),  the  sparsity- 
overlap  function  'if{B*)  is  equal  to  the  largest  squared  column  norm.  But  = 

X]i=i  sign(/3j^^^*)^  =  Sk.  Thus,  the  sample  complexity  of  the  group  Lasso  is  the  same  as  the 
ordinary  Lasso  in  this  case.^ 

®In  making  this  assertion,  we  are  ignoring  any  difference  between  log(p  —  Sk)  and  log(j3  —  s),  which  is 
valid,  for  instance,  in  the  regime  of  sublinear  sparsity,  when  Sk/p  ^  0  and  s/p  ^  0. 
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Finally,  we  consider  an  example  that  illustrates  the  effect  of  correlated  designs: 

Example  4.  Effects  of  correlated  designs  To  illustrate  the  behavior  of  the  sparsity- 
overlap  function  in  the  presence  of  correlations  in  the  design,  we  consider  the  simple  case  of 
two  regressions  with  support  of  size  2.  For  parameters  -i?!  and  'd2  G  [0,  vr]  and  p  G  (—1, 
consider  regression  matrices  B*  such  that  B*  =  C{B*g)  and 

C(BJ)  =  '>■'<"■>1  and  E5>=P  T  (23) 

*  [cOs(l?2)  sm('i?2)j  [p  1 

Setting  M*  =  ({Bg)'^T,gg(!^{Bg),  a  simple  calculation  shows  that 

tr(M*)  =  2(1 -|- /9cos('i?i  — '(?2)),  and  det(M*)  =  (1  —  sin(i9i  — 192)^, 

so  that  the  eigenvalues  of  M*  are 

=  (1 -f  p)(l -f  cos(-(9i--(92)),  and  p”  =  (1  -  p)(l  -  cos(i?i -i?2))- 

so  that  'ip{B*)  =  ma,x(p~^ ,  p~).  On  the  other  hand,  with 

=  C(d(')n  =  ( ^  ^  .(2)*3  ^  ( sign(sin(ili))\ 

^  ’  Vsign(cos(i?2))y  ^  Vsign(sin(il2))y  ’ 

=  l{cos(,9i)^0}  +  1{cos(,92)7^0}  +  2psign(cos(7?i)cos(i?2)), 

V’(/3^^)*)  =  Z2  Sss^'2  =  l{sin(,9i)^0}  +  l{sin(i?2)7^o}  +  2  p  sign(sin(7?i )  sin(i?2))- 

Figure  4  provides  a  graphical  comparison  of  these  sample  complexity  functions.  The 
function  'ip{B*)  =  max(^p(/3^^^*),  is  discontinuous  on  5  =  |ZxMUMx^Z,  and,  as  a 

consequence,  so  is  its  difference  with  ip^B*).  Note  that,  for  fixed  -di  or  fixed  '&2,  some  of  these 
discontinuities  are  removable  discontinuities  of  the  induced  function  on  the  other  variable, 
and  these  discontinuities  therefore  create  needles,  slits  or  flaps  in  the  graph  of  the  function  ip. 
Denote  by  (resp.  )  the  set  =  {(-i?!,  ??2)|  min[cos('i?i)  cos('(92),sin(i9i)  sin(i?2)]  >  0}, 
(resp.  TZ~  =  192)1  max[cos(??i)  cos(i?2))  sin(i?i)  sin(i92)]  <  0}  )  on  which  i^{B*)  reaches 

its  minimum  value  when  p  >  0.5  (resp.  when  p  <  0.5)  (see  middle  and  bottom  center 
plots  in  figure  4).  For  p  =  0,  the  top  center  graph  illustrates  that  ip(S*)  is  equal  to  2 
except  for  the  cases  of  matrices  Bg  with  disjoint  support,  corresponding  to  the  discrete  set 
V  =  {(/cf ,  {k±  l)f ),  /c  G  Z}  for  which  it  equals  1.  The  top  rightmost  graph  illustrates  that, 
as  shown  in  Corollary  1,  the  inequality  always  holds  for  an  uncorrelated  design.  For  p  >  0, 
the  inequality  ipiB*)  <  max(V’(/3^^^*), '0(/3*'^^*))  is  violated  only  on  a  subset  of  S  UTl~]  and 
for  p  <  0,  the  inequality  is  symmetrically  violated  on  a  subset  of  5  U  TZ~^  (see  Fig.  4). 

2.4  Illustrative  simulations 

In  this  section,  we  provide  the  results  of  some  simulations  to  illustrate  the  sharpness  of 
Theorem  1,  and  furthermore  to  ascertain  how  quickly  the  predicted  behavior  is  observed 
as  elements  of  the  triple  (n,  p,  s)  grow  in  different  regimes.  We  explore  the  case  of  two 
regression  tasks  (i.e.,  K  =  2)  which  share  an  identical  support  set  S  with  cardinality  \S\  =  s 
in  Section  2.4.1  and  consider  a  slightly  more  general  case  in  Section  2.4.2. 
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V’CB*)  =  max(V^(/3(i)*),  max(0,  xp{B*)  -  ^p{B*)) 


0  0  0  0  0  0 


Figure  1.  Comparison  of  sparsity-overlap  functions  for  ixjii  and  the  Lasso.  For  the 
pair  ^(■di,'d2),  we  represent  in  each  row  of  plots,  corresponding  respectively  to  p  =  0 
(top),  0.9  (middlel  and  —0.9  (bottom),  from  left  to  right,  the  quantities:  tp{B*)  (left), 
max(^/;(/3’^^)*), i/)(/3(^^*))  (center)  and  max(0, ^/>(i?*)  —  max(i/)(/3(^)*), ^/>(/3(^^*)))  (right).  The 
latter  indicates  when  the  inequality  <  max(^(/3(^^*), ■i/:(/3(^)*))  does  not  hold  and  by 

how  much  it  is  violated. 


2.4.1  Phase  transition  behavior 

This  first  set  of  experiments  is  designed  to  reveal  the  phase  transition  behavior  predicted  by 
Theorem  1.  The  design  matrix  X  is  sampled  from  the  standard  Gaussian  ensemble,  with 
i.i.d.  entries  Xij  ~  A^(0, 1).  We  consider  two  types  of  sparsity, 

•  logarithmic  sparsity,  where  s  =  alog(p),  for  a  =  2/log(2),  and 

•  linear  sparsity,  where  s  =  ap,  for  a  =  1/8, 

for  various  ambient  model  dimensions  p  G  {16,32,64,256,512,1024}.  For  a  given  triplet 
{n,p,  s),  we  solve  the  block-regularized  problem  (7)  with  the  regularization  parameter  = 
Y^log(p  —  s)  (log  s) /n.  For  each  fixed  {p,  s)  pair,  we  measure  the  sample  complexity  in  terms 
of  a  parameter  9,  in  particular  letting  n  =  9slog{p  —  s)  for  9  G  [0.25, 1.5]. 

We  let  the  matrix  B*  G  of  regression  coefficients  have  entries  P*j  in  {—1/ \/2, 1  /  V^}, 
choosing  the  parameters  to  vary  the  angle  between  the  two  columns,  thereby  obtaining 
various  desired  values  of  ■0(1?*).  Since  S  =  Ipxp  for  the  standard  Gaussian  ensemble, 
the  sparsity-overlap  function  0(i?*)  is  simply  the  maximal  eigenvalue  of  the  Gram  matrix 
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Figure  2.  Plots  of  support  union  recovery  probability  P[5=S']  versus  the  control  parameter 
9  =  n/[2slog(p  —  s)]  for  two  different  types  of  sparsity,  linear  sparsity  in  the  left  column 
(s  =  p/8)  and  logarithmic  sparsity  in  the  right  column  (s  =  21og2(p)))  and  using 
regularization  in  the  three  first  rows  to  estimate  the  support  respectively  in  the  three  cases 
of  identical  regression,  intermediate  angles  and  orthonormal  regressions.  The  fourth  row 
presents  results  for  the  Lasso  in  the  case  of  identical  regressions. 
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p=256  s=p/8=32  p=1024  s=p/8=128 


Figure  3.  Plots  of  support  recovery  probability  P[S'=S']  versus  the  control  parameter  9  = 
n/[2slog(p  — s)]  for  two  different  type  of  sparsity,  logarithmic  sparsity  on  top  (s  =  0(log(p))) 
and  linear  sparsity  on  bottom  (s  =  ap),  and  for  increasing  values  of  p  from  left  to  right. 
The  noise  level  is  set  at  u  =  0.1.  Each  graph  shows  four  curves  (black, red,  green,blue) 
corresponding  to  the  case  of  independent  ii  regularization,  and,  for  ^i/£2  regularization, 
the  cases  of  identical  regression,  intermediate  angles,  and  “orthonormal”  regressions.  Note 
how  curves  corresponding  to  the  same  case  across  different  problem  sizes  p  all  coincide,  as 
predicted  by  Theorem  1.  Moreover,  consistent  with  the  theory,  the  curves  for  the  identical 
regression  group  reach  P[S'  =  S']  ~  0.50  at  0  «  1,  whereas  the  orthogonal  regression  group 
reaches  50%  success  substantially  earlier. 


Since  \l3*j\  =  l/\/2  by  construction,  we  are  guaranteed  that  Bg  =  ({Bg), 
that  the  minimum  value  =  1,  and  moreover  that  the  columns  of  have  the  same 

Euclidean  norm. 

To  construct  parameter  matrices  B*  that  satisfy  \Pij\  =  11^/2,  we  choose  both  p  and  the 
sparsity  scalings  so  that  the  obtained  values  for  s  are  multiples  of  four.  We  then  construct 
the  columns  and  of  the  matrix  B*  =  C{B*)  from  copies  of  vectors  of  length  four. 
Denoting  by  (8>  the  usual  matrix  tensor  product,  we  consider  the  following  4- vectors: 


Identical  regressions:  We  set  =  Z(2)* 
is  'ipiB*)  =  s. 


so  that  the  sparsity-overlap  function 
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Orthonormal  regressions:  Here  B*  is  constructed  with  _L  so  that  tpiB*)  =  f , 

the  most  favorable  situation.  In  order  to  achieve  this  orthonormality,  we  set  = 
and  =  ^1^/2  (8)  (1,  -1)'^. 

Intermediate  angles:  In  this  intermediate  case,  the  columns  and  are  at  a 

60°  angle,  which  leads  to  =  |s.  Specifically,  we  set  Z^^^*  =  and 

Figures  2  shows  plots  of  linear  sparsity  (left  column)  and  logarithmic  sparsity  (right 
column)  for  all  three  cases  solved  using  the  group  ii/i2  relaxation  (top  three  rows),  as  well 
as  the  reference  Lasso  case  for  the  case  of  identical  regressions  (bottom  row).  Each  panel 
plots  the  success  probability  P[S'  =  5]  versus  the  rescaled  sample  size  6  =  n/[2slog(p  —  s)]. 
Under  this  re-scaling.  Theorem  1  predicts  that  the  curves  should  align,  and  that  the  success 
probability  should  transition  to  1  once  9  exceeds  a  critical  threshold  (dependent  on  the 
type  of  ensemble).  Note  that  for  suitably  large  problem  sizes  {p  >  128),  the  curves  do  align 
in  the  predicted  way,  showing  step-function  behavior.  Figure  3  plots  data  from  the  same 
simulations  in  a  different  format.  Here  the  top  row  corresponds  to  logarithmic  sparsity,  and 
the  bottow  row  to  linear  sparsity;  each  panel  shows  the  four  different  choices  for  B* ,  with 
the  problem  size  p  increasing  from  left  to  right.  Note  how  in  each  panel  the  location  of  the 
transition  of  =  S]  to  one  shifts  from  right  to  left,  as  we  move  from  the  case  of  identical 
regressions  to  intermediate  angles  to  orthogonal  regressions. 


2.4.2  Empirical  thresholds 

In  this  experiment,  we  aim  at  verifying  more  precisely  the  location  of  the  ii/£2  threshold 
as  the  regression  vectors  vary  continuously  from  identical  to  orthonormal.  We  consider  the 
case  of  matrices  B*  of  size  s  x  2  for  s  even.  In  Example  Sec.  4  of  Sec.  2.3,  we  characterized 
the  value  of  ipiB*)  if  H*  is  a  2  x  2  matrix. 

In  order  to  generate  a  family  of  regression  matrices  with  smoothly  varying  sparsity /overlap 
function  consider  the  following  2x2  matrix: 


Bi{a) 


1  1 
cos(|-|-a)  sm(|-|-Q;) 


(24) 


Note  that  a  is  the  angle  between  the  two  rows  of  Hi  (a)  in  this  setup.  Note  moreover  that 
the  columns  of  Hi  (a)  have  varying  norm. 

We  use  this  base  matrix  to  define  the  following  family  of  regression  matrices  Bg  G 


Hi 


ls/2  <8  Hi(a),  a  G 


(25) 


For  a  design  matrix  drawn  from  the  Standard  Gaussian  ensemble,  the  analysis  of  Ex¬ 
ample  Sec.  4  in  Sec.  2.3  naturally  extends  to  show  that  the  sparsity /overlap  function  is 
'0(Hsi(a))  =  |(1  +  |  cos(q;)|).  Moreover,  as  we  vary  a  from  0  to  |,  the  two  regressions  vary 
from  identical  to  ’’orthonormal”  and  the  sparsity /overlap  function  decreases  from  s  to  |. 

We  fix  the  problem  size  p  =  2048  and  sparsity  s  =  log2(p)  =  22.  For  each  value 
of  a  G  [0,  ^],  we  generate  a  matrix  from  the  specified  family  and  angle.  We  then  solve 
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the  block-regularized  problem  (7)  with  sample  size  n  =  26slog{p  —  s)  for  a  range  of  9  in 
[.25, 1.5];  for  each  value  of  6,  we  repeat  the  experiment  (generating  random  design  matrix 
X  and  observation  matrix  W  each  time)  over  T  =  500  trials.  Based  on  these  trials,  we 
then  estimate  the  value  of  9^q%  for  which  the  exact  support  is  retrieved  at  least  50%  of  the 
time.  Since  =  ^+1  cos(a)|  theory  predicts  that  if  we  plot  ^50%  versus  |  cos(a)|,  it 

should  lie  on  or  below  the  straight  line  a,lso  perform  the  same  experiments 

for  row  selection  using  the  ordinary  Lasso,  and  plot  the  resulting  estimated  thresholds  on 
the  same  axes. 

The  results  are  shown  in  Figure  4.  Note  first  that  the  curve  obtained  for  (blue 

circles)  coincides  roughly  with  the  theoretical  prediction  ^+1  cos(a)|  stashed  diagonal) 

as  regressions  vary  from  orthogonal  to  identical.  Moreover,  the  estimated  ^50%  of  the 
ordinary  Lasso  remains  above  0.9  for  all  values  of  a,  which  is  close  to  the  theoretical  value 
of  1.  However,  the  curve  obtained  is  not  constant,  but  is  roughly  sigmoidal  with  a  first 
plateau  close  to  1  for  cos(q;)  <  0.4  and  a  second  plateau  close  to  0.9  for  cos(a)  >  0.5.  The 
latter  coincides  with  the  empirical  value  of  ^50%  for  the  univariate  Lasso  for  the  first  column 
^(1)*  (not  shown).  There  are  two  reasons  why  the  value  of  ^59%  for  the  ordinary  Lasso  does 
not  match  the  prediction  of  the  first-order  asymptotics:  first,  for  ol  =  \  (corresponding  to 
cos(a)  =0.7),  the  support  of  is  reduced  by  one  half  and  therefore  its  sample  complexity 
is  decreased  in  that  region.  Second,  the  supports  recovered  by  individual  Lassos  for 
and  vary  from  uncorrelated  when  a  =  |  to  identical  when  a  =  0.  It  is  therefore  not 
surprising  that  the  sample  complexity  is  the  same  as  a  single  univariate  Lasso  for  cos(q;) 
large  and  higher  for  cos(q;)  small,  where  independent  estimates  of  the  support  are  more 
likely  to  include,  by  union,  spurious  covariates  in  the  row  support. 


3  Proof  of  Theorem  1 


In  this  section,  we  provide  the  proof  of  our  main  result.  For  the  convenience  of  the  reader, 
we  begin  by  recapitulating  the  notation  to  be  used  throughout  the  argument. 


•  the  sets  S  and  S‘^  are  a  partition  of  the  set  of  columns  of  X,  such  that  |5|  =  s,  |5^|  = 
p  —  s,  and 

•  the  design  matrix  is  partitioned  as  X  =  [X5  X^c] ,  where  X^  G  and  X^c  G 

]^nx(p-s)_ 


the  regression  coefficient  matrix  is  also  partitioned  as  B*  = 


Bl 


,  with  Bg  G 


hSXK 


and  Bgc  =  0  G 


s)xK _  We  use  j3*  to  denote  the  row  of  B*. 


the  regression  model  is  given  by  T  =  XB*  IT,  where  the  noise  matrix  IT  G 
has  i.i.d.  X(0,(T^)  entries. 


The  matrix  Zg  =  C{Bg)  G 


hsxK 


has  rows  Z*  =  CiPi)  = 


A* 


lid* 


G 


tK 
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1.1 


Figure  4.  Plots  of  the  Lasso  sample  complexity  0  =  n/[2s  log(p— s)]  for  which  the  probability 
of  union  support  recovery  exceeds  50%  empirically  as  a  function  of  |cos(a)|  for  £i-based 
recovery  and  ^1/^2  based  recovery,  where  a  is  the  angle  between  and  for  the 

family  Bi.  We  consider  the  two  following  methods  for  performing  row  selection:  Ordinary 
Lasso  (£1,  green  triangles)  and  group  ^^/^2  Lasso  (blue  circles). 


3.1  High-level  proof  outline 

At  a  high  level,  the  proof  is  based  on  the  notion  of  a  primal-dual  witness:  we  construct  a 
primal  matrix  B  along  with  a  dual  matrix  Z  such  that: 

(a)  the  pair  {B,  Z)  together  satisfy  the  Karush-Kuhn- Tucker  (KKT)  conditions  associated 
with  the  second-order  cone  program  (7),  and 


(b)  this  solution  certifies  that  the  SOCP  recovers  the  union  of  supports  S. 

For  general  high-dimensional  problems  (with  n),  the  SOCP  (7)  need  not  have  a  unique 
solution;  however,  a  consequence  of  our  theory  is  that  the  constructed  solution  B  is  the 
unique  optimal  solution  under  the  conditions  of  Theorem  1. 

We  begin  by  noting  that  the  block-regularized  problem  (7)  is  convex,  and  not  differen¬ 
tiable  for  all  B.  In  particular,  denoting  by  fdi  the  row  of  B,  the  subdifferential  of  the 
norm  £i/£2-block  norm  over  row  i  takes  the  form 


[d\\B\ 


IIPdl2 

Zi  such  that 


<  1 


if  A  7^  0 
otherwise. 


(26) 


We  also  use  the  shorthand  Ci^i)  =  /3i/||/3i||2  with  an  analogous  definition  for  the  matrix 
(^{Bs),  assuming  that  no  row  of  Bs  is  identically  zero.  In  addition,  we  define  the  empirical 
covariance  matrix 


S 


-X^X  =  -^XiXf, 
n  n 

2  =  1 


(27) 
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where  Xi  is  the  column  of  X.  We  also  make  use  of  the  shorthand  Tigs  =  and 

^5=5  =  as  well  as  ns  =  Xs'(S5s')“^Xj  to  denote  the  projector  on  the  range  of 

Xs. 

At  the  core  of  our  constructive  procedure  is  the  following  convex- analytic  result,  which 
characterizes  an  optimal  primal-dual  pair  for  which  the  primal  solution  B  correctly  recovers 
the  support  set  S: 


Lemma  2.  Suppose  that  there  exists 

Zs 


T.ss{Bs  -  B*s)  - -X^W 


An 


Zs^ 


Bs<^ 


a  primal-dual  pair  {B,  Z)  that  satisfies  the  eonditions: 


C{Bg) 

(28a) 

—  ^nZg 

(28b) 

tg.g{Bg  -  Bl)  --Xl,,W 

<  Afj 

(28c) 

n 

(■aa/t.2 

0. 

(28d) 

Then  (B,  Z)  is  a  primal-dual  optimal  solution  to  the  bloek- regularized  problem,  with  S{B)  = 
S  by  eonstruetion.  If  S55  0,  then  B  is  the  unique  optimal  primal  solution. 


See  Appendix  A  for  the  proof  of  this  claim.  Based  on  Lemma  2,  we  proceed  to  construct 
the  required  primal  dual  pair  {B,  Z)  as  follows.  First,  we  set  Bs<^  =  0,  so  that  condition  (28d) 
is  satisfied.  Next,  we  specify  the  pair  {Bg,  Zg)  by  solving  the  following  restricted  version  of 
the  SOCP: 


Bg  =  arg  min 


Y -X 

'i^sll 

[2n 

Osc  1 

+  ^n\\Bg\\£j^/£2 


(29) 


Since  s  <  n,  the  empirical  covariance  (sub)matrix  =  ^X^Xg  is  strictly  positive  definite 
with  probability  one,  which  implies  that  the  restricted  problem  (29)  is  strictly  convex  and 
therefore  has  a  unique  optimum  Bg.  We  then  choose  Zg  to  be  the  solution  of  equation  (28b). 
Since  any  such  matrix  Zg  is  also  a  dual  solution  to  the  SOCP  (29),  it  must  be  an  element 


of  the  subdifferential  d 


Bg 


It  remains  to  show  that  this  construction  satisfies  conditions  (28a)  and  (28c).  In  order 
to  satisfy  condition  (28a),  it  suffices  to  show  that  no  row  of  the  solution  Bg  is  identically 
zero.  From  equation  (28b)  and  using  the  invertibility  of  the  empirical  covariance  matrix 
Zigg,  we  may  solve  as  follows 


{Bg-B*s)  =  Ess 


-1 


-  XnZg 

n 


=  :  Ug. 


(30) 


Note  that  for  any  row  i  G  5,  by  the  triangle  inequality,  we  have 

\m\2  > 

Therefore,  in  order  to  show  that  no  row  of  Bg  is  identically  zero,  it  suffices  to  show  that 
the  event 


nUs)  := 


(31) 
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occurs  with  high  probability.  (Recall  from  equation  (13)  that  the  parameter  measures 
the  minimum  ^2-norm  of  any  row  of  B^.)  We  establish  this  result  in  Section  3.2. 

Turning  to  condition  (28c),  by  substituting  expression  (30)  for  the  difference  {Bs  —  B^) 
into  equation  (28c),  we  obtain  a  (p  —  s)  x  K  random  matrix  V^c,  with  rows  indexed  by  S^. 
For  any  index  j  G  S"^,  the  corresponding  row  vector  Vj  G  is  given  by 

F,  :=  Xj([Us-In]—-Xn—{%s)-^Zs).  (32) 

\  n  n  ) 

In  order  for  condition  (28c)  to  hold,  it  is  necessary  and  sufficient  that  the  probability  of  the 
event 

:=  {\\ysA\,^/t,  <  (33) 

converges  to  one  as  n  tends  to  infinity.  Consequently,  the  remainder  (and  bulk)  of  the  proof 
is  devoted  to  showing  that  the  probabilities  P[T(C/s’)]  and  P[T(F5c)]  both  converge  to  one 
under  the  specified  conditions. 


3.2  Analysis  of  £{Us)-  Correct  inclusion  of  supporting  covariates 

This  section  is  devoted  to  the  analysis  of  the  event  £{Us)  from  equation  (31),  and  in 
particular  showing  that  its  probability  converges  to  one  under  the  specified  scaling.  We 
begin  by  defining 

W  :=  ^(S55)“5AJiT. 

With  this  notation,  we  have 

W  -  1  - 

Us  =  £sS^-^n{£ss)-^Zs. 

\ju 

Using  this  representation  and  the  triangle  inequality,  we  have 


\\Us\\ 


loo  111  - 


< 


(Sss)’ 


1  w 

y/n 

100I12 

1  W 

2  - 

y/n 

100I12 

+ 


+ 


(Sss)  ^Zs 
{£ss)~^ 


I00II2 


Ti 


T2, 


where  the  form  of  T2  in  the  second  line  uses  a  standard  matrix  norm  bound  (see  equa¬ 


tion  (42a)  in  Appendix  B),  and  the  fact  that 


Zs 


I00II2 


<  1. 
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Using  the  triangle  inequality,  we  bound  T2  as  follows: 


T2  <  {|||(S55)“^||L  +  |||(Sss)“'-(S55)“^|L} 

<  An  |-Dmax  +  \/i  (£55)“^  -  (£55)“^ 

<  An  {z?max  +  ^/^  111(^55)"^  III2  |||  (^^  “'  "  ^- ||| 2  } 

<  An  { U>max  +  ^  {X^Xs/u)-^  -  h  |  , 

f  L'min  2  J 

which  defines  Xs  as  a  random  matrix  with  i.i.d.  standard  Gaussian  entries.  From  concen¬ 
tration  results  in  random  matrix  theory  (see  appendix  C),  for  s/n  ^  0,  we  have 

with  probability  1  —  exp(— 0(n)).  Overall,  we  conclude  that 


with  probability  1  —  exp(— 0(n)). 

Turning  now  to  Ti,  note  that  conditioned  on  Xs-,  we  have  (vec(VF)  |  X5)  ~  N{0sxK,  Is^Si 
Ik)  where  vec(A)  denotes  the  vectorization  of  matrix  A  .  Using  this  fact  and  the  definition 
of  the  block  £00/^2  norm. 


max  Ci  (Ss5  2  — 
i&S  Jn 

2 

^  1/2  r  1  1 

{^ss)~^  -maxCZ 
2  [n  ^e5 


which  defines  Cf  independent  x^'Variates  with  K  degrees  of  freedom.  Using  the  tail 
bound  in  Lemma  8  (see  Appendix  F)  with  t  =  2K\og  s  >  K,  we  have 

P  [-maxCZ  >  <  exp  f-2A:iogs  fl -2 

n  i£S  n  \  \  \ 

Defining  the  event  T  :  =  |  (£55)“^  <  we  have  P[T]  >  1  —  2exp(— 0(n)),  again 

using  concentration  results  from  random  matrix  theory  (see  Appendix  C).  Therefore, 


P  Ti  > 


8K  log  s 
Crnin'IT' 


<  p  Ti  >  T  +p[r"] 

V  ^min^ 

r  1  *9  4Alogs1  , 

<  P  —  maxCi>  -  -I- 2  exp(— 0(n)) 

n  ie5  n  J 

=  O  (exp(— ©(logs)))  ^  0. 
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Finally,  combining  the  pieces,  we  conclude  that  with  probability  1  —  exp(— 0(log  s)),  we 
have 


\\Us\\ 


< 


< 


iri  +  n 

^min 


With  the  assumed  scaling  n  =  VL  (slog(p  —  s)),  we  have 


^511 


too!  (-2 


b*  ■ 

min 


< 


1 


b*  ■ 

mm  L 


o 


1 


+  A„  (  1  +  O 


log(p  -  s) 


(34) 


with  probability  greater  than  1—2  exp(— clog(s))  ^  1  so  that  the  conditions  of  Theorem  1 
are  sufficient  to  ensure  that  event  £{Us)  holds  with  high  probability  as  claimed. 


3.3  Analysis  of  SiVsc):  Correct  exclusion  of  non-support 

For  simplicity,  in  the  following  arguments,  we  drop  the  index  and  write  V  for  In 
order  to  show  that  with  probability  converging  to  one,  we  make  use  of  the 


decomposition  A  \\V\\g^/^^ 

<  ELi  E'  where 

Ti 

:=  T||E[F  1 

(35a) 

:=  fimv\xs,n'\-nv\xs\\\,„i,. 

(35b) 

n 

'^n 

(35c) 

We  deal  with  each  of  these  three  terms  in  turn,  showing  that  with  high  probability  under 
the  specified  scaling  of  {n,p,s),  we  have  <  (1  —  7),  and  T2  =  Op(l),  and  Tg  <  7,  which 
suffices  to  show  that  ^  ll^llfoo/^2  ^  ^  with  high  probability. 

The  following  lemma  is  useful  in  the  analysis: 

Lemma  3.  Define  the  matrix  A  £  with  rows  A,  :  =  Cj/||/3*||2.  As  long  as  ||Aj||2  <  1/2 

for  all  row  indiees  i  £  S,  we  have 


zs  -  aB*s) 


(■oat  (-2 


<  4IIAI 


(oat  (2 


See  Appendix  D  for  the  proof  of  this  claim. 


3.3.1  Analysis  of  T[ 

Note  that  by  definition  of  the  regression  model  (4),  we  have  the  conditional  independence 
relations 


VF_LL  As'c  I  Xg,  Zs-A  \  Xg,  and  Zg-A  Xg^^  \  {Xg,W}. 
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Using  the  two  first  conditional  independencies,  we  have 


E[U|X5]  =  -Xn^{^SSr^^[Zs\Xs])  ■ 

\  n  n  J 

Since  E[VU|X5']  =  0,  the  first  term  vanishes,  and  using  E[Xjc|X5]  =  we 

obtain 

E  [U  I  Xs]  =  [Zs\Xs].  (36) 

Using  the  matrix-norm  inequality  (42a)  of  Appendix  B  and  then  Jensen’s  inequality  yields 

Ti  =  \\i:s^sX^-lE[Zs\Xs]h^/e, 

<  ii|S5=5S5l||LE[||Zs||,^/,JX5] 

<  (1-7)- 

3.3.2  Analysis  of 

Appealing  to  the  conditional  independence  relationship  Zs'iLX5c  |  {X^,  lU},  we  have 

E  [U  I  Xs,  lU]  =  E  [Xj.  I  Xs,  lU]  (^Us  (Es5)“'E  [Zs\Xs,  W]^  . 

Observe  that  E  [Zs\Xs-,  W]  =  Zs  because  (X^,  W)  uniquely  specifies  Bs  through  the  convex 
program  (29),  and  the  triple  (X5,  W,  Bs)  defines  Zs  through  equation  (28b).  Moreover,  the 
noise  term  disappears  because  the  kernel  of  the  orthogonal  projection  matrix  (/„  —  ns)  is 
the  same  as  the  range  space  of  X5,  and 

E  [Xj.  I  X5,  lU]  [Us  -In]  =  E  [Xj.  I  X5]  [Us  -  In] 

=  Z,ScsUggXg[Us  -  In]  =  0. 

We  have  thus  shown  that  E  [V  \  Xs,  W]  =  —^Us^s  Ug^Zs,  so  that  we  can  conclude  that 
T'  <  |||Ssc5(S55)-'||Lll^5-E[^5|^5]||£^/^, 

<  (1-7)E  [  ^  ^s-Z*s  ^ 

^00/^2  'Coo/^2 

<  (1-7)  4|e[||A||^^/^J  -h 

where  the  final  inequality  uses  Lemma  3.  Under  the  assumptions  of  Theorem  1,  this  final 
term  is  of  order  Op(l),  as  shown  in  Section  3.2. 

3.3.3  Analysis  of  Tg 

This  third  term  requires  a  little  more  care.  We  begin  by  noting  that  conditionally  on  Xs  and 
W,  each  vector  Vj  G  is  normally  distributed.  Since  Cov(X('^)  |  X5,1T)  =  {T,sc\s)jj  In, 
we  have 

Cov(U,  I  X5,1U)  =  Mn{Usc\s)jj 
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where  the  K  x  K  random  matrix  =  Mn{Xs,  W)  is  given  by 

M„  :=  ^zU^ss)~^Zs  +  ^W^{Us-In)W.  (37) 

n 

Conditionally  on  W  and  Xs^  the  matrix  Mn  is  fixed,  and  we  have 

( ||F,  -  E  [i/,  I  Xs,  hb] Hi  \W,Xs)  =  (S5C I 

where  ~  N{0k,  Ik)-  Since  {'Zsc^s)jj  <  (T,s‘:s‘^)jj  <  C'max  for  all  j,  we  have 

max{'Zsc\s)jj  <  C'max  |||M„|||2  max  ll^jlli 

je5=  j£S‘= 

where  |||Mn|||2  is  the  spectral  norm. 

We  now  state  a  result  that  provides  control  on  this  spectral  norm.  Intuitively,  this 
result  is  based  on  the  fact  that  the  matrix  is  a  random  matrix  that  concentrates 

'^n 

in  spectral  norm  around  the  matrix  M*  =  Zg'^{T,ss)~^Zg,  where  Zg  =  (^{Bg),  and  the 
fact  that  the  spectral  norm  of  M*  is  directly  proportional  to  the  defined  sparsity /overlap 
function  ^>(7?*)  :=  |||C(SJ)'^(S55)-'C(i3J)|||2- 

Lemma  4.  For  any  d  >  0,  define  the  event 

T{5)  :=  {|||M„|||2  <A2  ^^(1  +  5)}.  (38) 

Under  the  eonditions  of  Theorem  1,  for  any  5  >  0,  there  is  some  ci  >  0  sueh  that 
P[T((5)''^]  <  2exp(— Cl  logs)  ^  0. 

See  Appendix  E  for  the  proof  of  this  lemma. 

Using  Lemma  4,  we  can  now  complete  the  proof.  For  any  fixed  (5  >  0  (which  can  be 
made  arbitrarily  small),  we  have 

e[t'>7]  <  P[r'>7  I  r(,5)]+p[r(,5)'=]. 


Since  P[T((5)‘^]  ^  0  from  Lemma  4,  it  suffices  to  deal  with  the  first  term.  Conditioning  on 
the  event  F{S),  we  have 


p[r'  >  7  I  r{6)]  < 


max  W^jWl  >  — - 


n 

(1  +  5)’ 


Define  the  quantity  t*{n,B*)  :=  I  y —  ip{B*)(i+5)  ’  note  that  t*  +00  under  the 
specified  scaling  of  {n,p,s).  By  applying  Lemma  8  from  Appendix  F  on  large  deviations 
for  7^-variates  with  t  =  t*{n,  B*),  we  obtain 


P[r'  >  7  I  T{5)] 


< 


{p  —  s)  exp 


<  (p  -  s)  exp  (-t*  (1  -  d))  , 


(39) 
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for  {n,p,s)  sufficiently  large.  Thus,  the  bound  (39)  tends  to  zero  at  rate  0(exp(— clog(p — 
s)))  as  long  as  there  exists  u  >  0  such  that  we  have  (1  —  S)  t*{n,  B*)  >  (1  +  z/)  log(|?  —  s), 
or  equivalently 

^  >  (^  +  ^)n  log(p-s)], 

as  claimed. 

4  Discussion 

In  this  paper,  we  have  analyzed  the  high-dimensional  behavior  of  block-regularization  for 
multivariate  regression  problems,  and  shown  that  its  behavior  is  governed  by  the  sample 
complexity  parameter 

0h/e2{n,p,s)  :=  n/[2^{B*)log{p- s)], 

where  n  is  the  sample  size,  p  is  the  ambient  dimension,  and  is  a  sparsity-overlap  function 
the  measures  a  combination  of  the  sparsity  and  overlap  properties  of  the  true  regression 
matrix  B* . 

There  are  a  number  of  open  questions  associated  with  this  work.  First,  note  that  the 
current  paper  provides  only  an  achievability  condition  (i.e.,  support  recovery  can  be  achieved 
once  the  control  parameter  is  larger  than  some  finite  critical  threshold  t*).  However,  based 
both  on  empirical  results  (see  Figures  2  and  3)  and  technical  aspects  of  the  proof,  we 
conjecture  that  our  characterization  is  in  fact  sharp,  meaning  that  the  block-regularized 
convex  program  (7)  fails  to  recover  the  support  once  the  control  parameter  drops 

below  some  critical  threshold.  Indeed,  this  conjecture  is  consistent  in  the  special  case  of 
univariate  regression  with  K  =  1,  where  it  is  known  (Wainwright,  2006)  that  the  Lasso  fails 
once  the  ratio  n/[2slog(p  —  s)]  falls  below  a  critical  threshold.  Secondly,  the  current  work 
applies  to  the  “hard” -sparsity  model,  in  which  a  subset  S  of  the  regressors  are  non-zero,  and 
the  remaining  coefficients  are  zero.  As  with  the  ordinary  Lasso,  it  would  also  be  interesting 
to  study  block-regularization  under  soft  sparsity  models  (e.g.,  £q  “balls”  for  coefficients, 
with  g  <  1),  under  an  alternative  loss  function  such  as  mean-squared  error,  as  opposed  to 
the  exact  support  recovery  criterion  considered  here. 
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A  Proof  of  Lemma  2 

Using  the  notation  f3i  to  denote  a  row  of  B  and  denoting  by 

fC  :  =  {(re,  v)  G  x  M  |  ||rc||2  <  v}  (40) 


25 


the  usual  second-order  cone  (SOC),  we  can  rewrite  the  original  convex  program  (7)  as 


min 
B  e 
beRP 

s.t. 


2n 


\\Y  -XBfp  +  X^^h 


i=l 


{f3i,bi)  G  /C,  1  <  i  <  p. 


We  now  dualize  the  conic  constraints  (Boyd  and  Vandenberghe,  2004),  using  conic  Lagrange 
multipliers  belonging  to  the  dual  cone  K.*  =  {{z,t)  G  z'^w  +  vt  >  0,(w,u)  G  1C}. 

The  second-order  cone  JC  is  self-dual  (Boyd  and  Vandenberghe,  2004),  so  that  the  convex 
program  (41)  is  equivalent  to 


min  max 

BeRPxK 

b£RP  t£RP 

S.t. 


1 

2n 


p 

Y-XB\ll  +  XnY,bi 

i=l 


P 

An^  {-zf  Pi  +  tibi) 
i=l 


{Zi,ti)  G  /C,  1  <  i  <  p, 


where  Z  is  the  matrix  whose  row  is  z*. 

Since  the  original  program  is  convex  and  strictly  feasible,  strong  duality  holds  and  any 
pair  of  primal  {B*,b*)  and  dual  solutions  has  to  satisfy  the  Karush-Kuhn- Tucker 

conditions: 


\\Pth<bi,  l<i<p  (41a) 

\\z*h<t*,  l<i<p  (41b) 

zfp:  -  m  =  0,  1  <  i  <  p  (41c) 

Vb  |||y  -  +A„Z*  =  0  (41d) 

A„(l  -  tn  =  0  (41e) 


Since  equations  (41c)  and  (41e)  impose  the  constraints  t*  =  1  and  b*  =  ||/3*||2,  a  primal-dual 
solution  to  this  conic  program  is  determined  by  {B*,  Z*). 

Any  solution  satisfying  the  conditions  in  Lemma  2  also  satisfies  these  KKT  conditions, 
since  equation  (28b)  and  the  definition  (28c)  are  equivalent  to  equation  (41d),  and  equa¬ 
tion  (28a)  and  the  combination  of  conditions  (28d)  and  (28c)  imply  that  the  complementary 
slackness  equations  (41c)  hold  for  each  primal-dual  conic  pair  {Pi,Zi). 

Now  consider  some  other  primal  solution  B]  when  combined  with  the  optimal  dual 
solution  Z,  the  pair  (B,Z)  must  satisfy  the  KKT  conditions  (Bertsekas,  1995).  But  since 
for  j  G  5"^,  we  have  ||%||2  <  1,  then  the  complementary  slackness  condition  (41c)  implies 
that  for  all  j  G  S^,  Pj  =  0.  This  fact  in  turn  implies  that  the  primal  solution  B  must  also  be 
a  solution  to  the  restricted  convex  program  (29),  obtained  by  only  considering  the  covariates 
in  the  set  S  or  equivalently  by  setting  Bs^  =  Osc.  But  since  s  <  n  by  assumption,  the  matrix 
XgXs  is  strictly  positive  definite  with  probability  one,  and  therefore  the  restricted  convex 
program  (29)  has  a  unique  solution  B^  =  Bs-  We  have  thus  shown  that  a  solution  {B,  Z) 
to  the  program  (7)  that  satisfies  the  conditions  of  Lemma  2,  if  it  exists,  must  be  unique. 
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B  Inequalities  with  block-matrix  norms 


In  general,  the  two  families  of  matrix  norms  that  we  have  introduced,  || 
are  distinct,  but  they  coincide  in  the  following  useful  special  case: 

Lemma  5.  For  1  <  p  <  oo  and  for  r  defined  byl/r  +  l/p  =  l  we  have 


\\\p,q 


and 


l^a/4’ 


“  III  •  llloo,r- 

Proof.  Indeed,  if  Ui  denotes  the  row  of  A,  then 
\\A\\,  =  max  ||ai|L  =  max  max  yf  Oi  =  max  max  |y^ai|  =  max  ||A|/||oo  =  |||^|||oo,r 

p  I  I  ||j,,;|L<l  ll?/llr<l  *  llwllr.<l 


We  conclude  by  stating  some  useful  bounds  and  relations: 

Lemma  6.  Consider  matriees  A  G  arid  Z  G  and  p,r  >  0  with  ^  ^  =  1,  we 

have: 


\\AZ\\,^/,^  =  |||^Z|||oo,.< 


< 

r  — 


r,  cxD 


—  A/p 

OO,  r  —  ^ 


|||'^|||oo,r  — 

l^/h 


IZI 


toAip  ■ 


(42a) 

(42b) 


C  Some  concentration  inequalities  for  random  matrices 


In  this  appendix,  we  state  some  known  concentration  inequalities  for  the  extreme  eigenvalues 
of  Gaussian  random  matrices  (Davidson  and  Szarek,  2001).  Although  these  results  hold 
more  generally,  our  interest  here  is  on  scalings  (n,  s)  such  that  s/n  ^  0. 

Lemma  7.  Let  U  G  he  a  random  matrix  from  the  standard  Gaussian  ensemble  (i.e., 

Uij^N{f),l),  i.i.d.).  Then 


1  r 

-U^U-Is 

n 


<  2exp(— cn)  ^  0. 


(43) 


This  result  is  adapted  easily  to  more  general  Gaussian  ensembles.  Letting  X  = 
we  obtain  an  n  x  s  matrix  with  i.i.d.  rows,  Xi  ~  N(0,A).  If  the  covariance  matrix  A  has 
maximum  eigenvalue  Cmax  <  +oo,  then  we  have 


n-^X^X  -All 


VA[n-^U'^U  -  I]VA 


(44) 


so  that  the  bound  (43)  immediately  yields  an  analogous  bound  on  different  constants. 
The  final  type  of  bound  that  we  require  is  on  the  difference 


|||(A^X/n)-^-A-i|||,, 
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assuming  that  X"^X  is  invertible.  We  note  that 

\\\{X^X/n)-^-A-%  =  \\\{X^X/n)-^[A-{X^X/n)]A-% 

<  \\\{X^X/n)-%\\\A-iX^X/n)\\\,  |||A-|||,. 

As  long  as  the  eigenvalues  of  A  are  bounded  below  by  Cmin  >  0,  then  |||A“^|||2  <  l/Cmin- 
Moreover,  since  s/n  0,  we  have  (from  equation  (44))  that  |||(X^X/n)“^|||2  <  2lCram 
with  probability  converging  to  one  exponentially  in  n.  Thus,  equation  (44)  implies  the 
desired  bound. 


D  Proof  of  Lemma  3 


From  the  previous  section,  the  condition  ||Ai||2  <  1/2  implies  that  /?*  7^  0  and  hence 
=  /Si/ II A II 2  for  all  rows  i  G  S.  Therefore,  using  the  notation  Z*  =  /3*/||/3*||2  we  have 


Zi-z;  = 


M2 


—  Z*  = 

*  “  ||z;  +  A  ' 


-  z* 


=  z* 


1 


i||2 

“  1  I  +  71 


A,. 


z;  +  A 


i||2 


i^r+A*ii2 

Note  that,  for  z/O,  g{z,S)  =  is  differentiable  with  respect  to  <5,  with  gradient 

Vs  g{z,  6)  =  ~2||^+^||3  •  the  mean-value  theorem,  there  exists  h  G  [0, 1]  such  that 

{z  +  h5)^6 


1 


-  1  =  g{z,  (5)  -  g{z,  0)  =  g{z,  kSy  5  =  - 


\\z  +  S\\2 

which  implies  that  there  exists  hi  G  [0, 1]  such  that 

\{Z*  +  h,A,fA, 


2\\z  +  h6\\l' 


\Zi-Z* 


2  < 


< 


iz; 


2||z;  +  /riA'ii3 


+ 


I  A, 


i||2 


|Ad 


2\\Z*  +  h,A,\\l 


+ 


■2^2112  II--2 

Adb 


Z/  +  A, 


i||2 


|z;  -h  A, 


i  2 


(45) 


We  note  that  ||Z*||2  =  1  and  ||Aj||2  <  5  imply  that  ||Z*  -|-  hiAi\\2  >  5.  Combined  with 
inequality  (45),  we  obtain  ||Zj  —  Z/II2  <  4||Aj||2,  which  proves  the  lemma. 


E  Proof  of  Lemma  4 

With  Zg  =  ({Bg),  define  the  K  x  K  random  matrix 

M*  :=  ^{Z*sf{^ss)-^Z*s  +  ^W^{In-Us)W 

and  note  that  (using  standard  results  on  Wishart  matrices  (Anderson,  1984)) 

E[M*]  =  --^^{Z*sf{Ess)-^Z*s  +  a^'^lK.  (46) 
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To  bound  in  spectral  norm,  we  use  the  triangle  inequality: 


WMnh  <  |||m„-m;ii|2+|||m;-e[m;]|||2+|||e[m;]|||2. 


(47) 


Ai 


Considering  the  term  Ai  in  the  decomposition  (47),  we  have: 


IIIK-M.III2  =  ^ 
n 

~  n 


Z*  —  1  /7*  /7  —  i  /7 

S^SS^S  ~  ^S^SS^S 
*  v'  —  1  ^  rz* 


;-l 


Z*s^-^,{Z*s  -  Zs)  +  {Z*s  -  Zs)j:sI{z*s  +  {Zs  -  z*s)) 


^-1 

^ss 


Z*s-Zs 


2P5III2  + 


Z*s-Zs 


(48) 


Using  the  concentration  results  on  random  matrices  in  Appendix  C,  we  have  the  bound 
<  2/(7111111  with  probability  greater  than  1  —  2exp(— cn),  and  we  have  |||Z^|||2  = 


^-1 


0{^/s)  by  definition.  Moreover,  from  equation  (42b)  in  Lemma  5,  we  have 


Z*  -  Z5 


Z*  -  Zs 


(■IX1H2 


.  Using  the  bound  (34)  and  Lemma  3,  we  have 


Z*s  -  Zs 


too!  (-2 


< 

2 

=  0(1) 


with  probability  greater  than  1  —  2 exp(— clog s),  so  that  from  equation  (48),  we  conclude 
that 


Ai  =  |||M:-M„|||2  =  o{^ 


w.h.p. 


(49) 


Turning  to  term  A2,  we  have  the  upper  bound  A2  <tI  +  ,  where 


AH 


tI  =  -^WZMl 


n 


n  —  s  —  1 


{'Zss)  -  {'Zss) 


-1 


We  have  =  o  with  probability  greater  than  1  —  2exp(— cn),  since  |||Z^|||2  <  s,  and 

(E55’)~^  —  (Ss'5)“^||  =  0(1)  with  high  probability  (see  Appendix  C).  Turning  to 


T2,  we  have  with  probability  greater  than  1  —  2exp(— cn), 

llW^Un-asW-^Hn-SiKll,  =  o(i)  =  »(v 


since  A^s  ^  +00.  Overall,  we  conclude  that 

A2  =  |||m:-e[m:]|||2  =  of  ^ 

\  7i 


w.h.p. 


Finally,  turning  to  A3  =  |||E  [M*]|||2  ,  from  equation  (46),  we  have 


ll|E[K]lll2  < 


n 


n  n  —  s  —  1 


+  7  (1  _  i)  =  ,1 +„,!)) 


n 


(50) 


(51) 


29 


Combining  bounds  (49),  (50),  and  (51)  in  the  decomposition  (47),  and  using  the  fact 
that  =  0(s)  (see  Lemma  1(a))  yields  that 


HIM, 


n||l2 


<  (1  +  0(1)) 


n 


with  probability  greater  than  1  —  2exp(clogs),  which  establishes  the  claim. 


F  Large  deviations  for  x^-variates 

Lemma  8.  Let  be  i.i.d.  ^-variates  with  d  degrees  of  freedom.  Then  for  all 

t  >  d,  we  have 


P[  max  Zi  >  2t] 


<  m  exp 


(52) 


Proof  Given  a  central  x^-variate  X  with  d  degrees  of  freedom,  Laurent  and  Massart  (1998) 
prove  that  P[X  —  d>  2\fdx  +  2x]  <  exp(— x),  or  equivalently 


P  X  >  X  +  (a/x  +  \/d)^  <  exp(— x), 

valid  for  all  x  >  0.  Setting  ^/x  +  ^fd  =  \/t,  we  have 


(a) 

P[x  >  2t]  < 


X  >  {Vi  —  Vd)"^  +  t 


<  exp{—{Vi—Vd)'^) 

<  exp(— t  +  2Vtd) 


where  inequality  (a)  follows  since  Vi  >  Vd  by  assumption.  Thus,  the  claim  (52)  follows  by 
the  union  bound.  □ 
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